Skip to content

[LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline#13686

Open
ChinChyi wants to merge 2 commits into
huggingface:mainfrom
ChinChyi:add-unillada-pipeline
Open

[LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline#13686
ChinChyi wants to merge 2 commits into
huggingface:mainfrom
ChinChyi:add-unillada-pipeline

Conversation

@ChinChyi
Copy link
Copy Markdown

@ChinChyi ChinChyi commented May 6, 2026

What does this PR do?

Adds support for LLaDA 2.0-Uni, a unified multimodal discrete diffusion language model that supports text understanding, image understanding, and image generation in a single framework.

Paper: LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

New Components

  • LLaDA2UniImageTransformer2DModel — Image diffusion transformer for decoding VQ tokens to images
  • UniLLaDaPipeline — Unified pipeline supporting three modes:
    • Text-to-image generation
    • Image understanding (VQA, captioning)
    • Image editing
  • LLaDA2UniFlowMatchEulerScheduler — Flow matching scheduler with Euler ODE integration
  • Image tokenizer utilities — SigVQ-based image encoding/decoding

Key Features

  • Multimodal capabilities: Single model handles both vision and language tasks
  • Discrete diffusion: Block-wise iterative refinement for token generation
  • FP8 quantization support: Efficient inference with quantized weights
  • Flexible decoding: Supports both quality mode (50 steps) and turbo mode (8 steps)

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from diffusers import UniLLaDaPipeline, BlockRefinementScheduler
from diffusers.pipelines.unillada.image_tokenizer import ImageTokenizer

model_id = "inclusionAI/LLaDA2.0-Uni"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
scheduler = BlockRefinementScheduler()
image_tokenizer = ImageTokenizer(model_path=model_id)

pipe = UniLLaDaPipeline(
    transformer=model,
    tokenizer=tokenizer,
    scheduler=scheduler,
    image_tokenizer=image_tokenizer,
)

# Text-to-Image
result = pipe(prompt="A cat sitting on a windowsill at sunset")
result.images[0].save("output.png")

# Image Understanding
from PIL import Image
img = Image.open("photo.jpg")
result = pipe(image=img, question="Describe this image in detail.")
print(result.text)

# Image Editing
result = pipe(image=img, instruction="Change the background to a beach.")
result.images[0].save("edited.png")

Testing

  • Added unit tests in tests/pipelines/unillada/test_unillada.py
  • Tests cover all three modes (generation, understanding, editing)
  • Mock components for CI compatibility

Model Weights

Official weights available at: https://huggingface.co/inclusionAI/LLaDA2.0-Uni

Before submitting

  • Did you read the contributor guideline?
  • Did you read our philosophy doc?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@yiyixuxu @a-r-r-o-w @DN6

Add UniLLaDA pipeline supporting text-to-image, image understanding,
and image editing via block-wise iterative discrete diffusion.

New components:
- UniLLaDaPipeline: main pipeline (DiffusionPipeline subclass)
- LLaDA2UniImageTransformer2DModel: image transformer model
- LLaDA2UniFlowMatchEulerScheduler: flow matching scheduler
- ImageTokenizer: VQ image encoder helper
- Documentation and tests
@github-actions github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines schedulers and removed size/L PR with diff > 200 LOC labels May 6, 2026
@dg845 dg845 requested review from dg845 and yiyixuxu May 14, 2026 08:47
@github-actions github-actions Bot added the size/L PR with diff > 200 LOC label May 15, 2026
return torch.cat(result, dim=-1)


class LLaDA2UniImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is LLaDA2UniImageTransformer2DModel intended to be used as part of the UniLLaDA pipeline? I see that the transformer loaded by the pipeline is a remotely implemented transformers model (LLaDA2MoeModelLM in modeling_llada2uni_moe.py), and this transformer doesn't appear to be used anywhere.

Comment on lines +41 to +43
>>> model = AutoModelForCausalLM.from_pretrained(
... model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
... )
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a diffusers-native (or transformers-native) implemention of the DiT model so that we don't need trust_remote_code=True.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's really ok to have trust_remote_code here, ideally transformer native but it is up to them
we are not going to host transformer models in diffusers

return_dict: bool,
) -> UniLLaDaPipelineOutput | tuple:
"""Text-to-image generation."""
result = self.transformer.generate_image(
Copy link
Copy Markdown
Collaborator

@dg845 dg845 May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the denoising loop should be implemented in UniLLaDaPipeline.__call__ using a scheduler (such as BlockRefinementScheduler), which is the standard diffusers design, rather than in transformer methods like generate_image.

# ============================================================


class ImageTokenizer:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ImageTokenizer:
class ImageTokenizer(ModelMixin, ConfigMixin):

I think ImageTokenizer should inherit from ModelMixin and ConfigMixin (which is standard for diffusers models) so that saving and loading can be handled in the normal diffusers way, rather needing to implement it separately in __init__ below.

OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711]


class ImagePreprocessor:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor the image preprocessing logic in ImagePreprocessor into a dedicated VaeImageProcessor subclass that lives in its own file (e.g. image_processor.py). See for example JoyImageEditImageProcessor as a reference:

class JoyImageEditImageProcessor(VaeImageProcessor):

Comment on lines +235 to +240
attn_impl = getattr(self.config, "_attn_implementation", "eager")
if attn_impl != "eager" and attn_impl in ALL_ATTENTION_FUNCTIONS:
attention_interface = ALL_ATTENTION_FUNCTIONS[attn_impl]
if "flash" in attn_impl:
max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()
attn_output, _ = attention_interface(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using dispatch_attention_fn instead as it handles the attention backends used here, such as Flash Attention (including flash_varlen) and torch native SDPA. For reference, see the attention backend docs.

return self.net(x)


class SigVQ(nn.Module):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the SigVQ model intended to be used as part of the UniLLaDA pipeline? I don't see it being used anywhere.

import PIL.Image


def generate_crop_size_list(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #13686 (comment), I think we should refactor the image preprocessing logic here into a dedicated VaeImageProcessor subclass (possibly combined with the one from image_tokenizer.py).

Copy link
Copy Markdown
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left an initial design review :).

@ChinChyi ChinChyi changed the title [UniLLaDA] Add UniLLaDA multimodal discrete diffusion pipeline [LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation models pipelines schedulers size/L PR with diff > 200 LOC tests utils

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants